We are analysing a dataset from NASA’s Goddard Institute for Space Studies to study the effects of climate change in the Northern Hemisphere. Glimpsing at the data, we see there are 19 variables and 143 observations, representing the period between 1880-2022:
weather <-
read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
skip = 1,
na = "***")
glimpse(weather)## Rows: 143
## Columns: 19
## $ Year <dbl> 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890…
## $ Jan <dbl> -0.36, -0.30, 0.26, -0.58, -0.16, -1.01, -0.75, -1.08, -0.49, -0…
## $ Feb <dbl> -0.50, -0.21, 0.21, -0.66, -0.11, -0.45, -0.84, -0.71, -0.61, 0.…
## $ Mar <dbl> -0.23, -0.03, 0.02, -0.15, -0.64, -0.23, -0.71, -0.44, -0.64, -0…
## $ Apr <dbl> -0.29, 0.01, -0.30, -0.30, -0.59, -0.49, -0.37, -0.38, -0.22, 0.…
## $ May <dbl> -0.06, 0.05, -0.23, -0.25, -0.36, -0.58, -0.34, -0.26, -0.15, -0…
## $ Jun <dbl> -0.16, -0.32, -0.28, -0.11, -0.41, -0.45, -0.37, -0.20, -0.03, -…
## $ Jul <dbl> -0.17, 0.09, -0.28, -0.05, -0.41, -0.33, -0.14, -0.24, 0.00, -0.…
## $ Aug <dbl> -0.25, -0.03, -0.14, -0.22, -0.51, -0.41, -0.43, -0.55, -0.21, -…
## $ Sep <dbl> -0.22, -0.26, -0.24, -0.34, -0.45, -0.40, -0.33, -0.21, -0.20, -…
## $ Oct <dbl> -0.30, -0.43, -0.51, -0.16, -0.44, -0.37, -0.31, -0.49, -0.03, -…
## $ Nov <dbl> -0.41, -0.36, -0.33, -0.44, -0.57, -0.38, -0.40, -0.27, -0.01, -…
## $ Dec <dbl> -0.40, -0.23, -0.68, -0.15, -0.47, -0.11, -0.22, -0.43, -0.24, -…
## $ `J-D` <dbl> -0.28, -0.17, -0.21, -0.28, -0.43, -0.43, -0.43, -0.44, -0.24, -…
## $ `D-N` <dbl> NA, -0.18, -0.17, -0.33, -0.40, -0.46, -0.43, -0.42, -0.25, -0.1…
## $ DJF <dbl> NA, -0.30, 0.08, -0.64, -0.14, -0.64, -0.57, -0.67, -0.51, -0.08…
## $ MAM <dbl> -0.19, 0.01, -0.17, -0.24, -0.53, -0.43, -0.47, -0.36, -0.34, 0.…
## $ JJA <dbl> -0.19, -0.09, -0.23, -0.13, -0.44, -0.40, -0.31, -0.33, -0.08, -…
## $ SON <dbl> -0.31, -0.35, -0.36, -0.31, -0.48, -0.38, -0.35, -0.32, -0.08, -…
For the purpose of our analysis, we have decided to select only data pertaining to temperature deviation (delta) by month, and manipulate the dataframe to facilitate further investigation:
tidyweather <- select(weather, 1:13) %>%
pivot_longer(!Year, names_to = 'month', values_to = 'delta')
head(tidyweather)## # A tibble: 6 × 3
## Year month delta
## <dbl> <chr> <dbl>
## 1 1880 Jan -0.36
## 2 1880 Feb -0.5
## 3 1880 Mar -0.23
## 4 1880 Apr -0.29
## 5 1880 May -0.06
## 6 1880 Jun -0.16
First, we are plotting a scatter plot to visualize the evolution of delta (temperature deviation) over time:
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), month, "1")),
month = month(date, label=TRUE),
year = year(date))
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
theme_bw() +
labs (
title = "Weather Anomalies"
)Adding a line of best fit to the scatterplot clearly shows that, while deltas were close to 0 between approximately 1935-1975 and negative before that time, temperature in the Northern Hempishere has been quickly increasing since then. Notice that the rate of the increase has been increasing as well (as signified by increasing deltas).
Next, we will use facet_wrap() to visualize the effects
of increasing temperature by month:
We can see that the effect of rising temperature in the Northern Hemisphere is common to all months of the year.
As a means to further investigate the effects of climate change, we
will partition the data into time periods, particularly decades. To that
end, we will use case_when():
comparison <- tidyweather %>%
filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"
))In order to study the effects of climate change by decade, we will produce a density plot to investigate the distribution of monthly temperature deviations by the time periods selected above:
ggplot(comparison) +
aes(x = delta, fill = interval)+
#facet_wrap(~month)+
geom_density(alpha = 0.2) The plot clearly shows that with the passage of time, deltas have consistently moved to the right hand side of the plot. In other words, temperature deviations have been increasing over time.
Lastly, we will also consider annual anomalies by grouping the monthly data and producing a scatterplot:
#creating yearly averages
average_annual_anomaly <- tidyweather %>%
group_by(Year) %>% #grouping data by Year
# creating summaries for mean delta
# use `na.rm=TRUE` to eliminate NA (not available) values
summarise(yearly_mean = mean(delta, na.rm=TRUE))
average_annual_anomaly## # A tibble: 143 × 2
## Year yearly_mean
## <dbl> <dbl>
## 1 1880 -0.279
## 2 1881 -0.168
## 3 1882 -0.208
## 4 1883 -0.284
## 5 1884 -0.427
## 6 1885 -0.434
## 7 1886 -0.434
## 8 1887 -0.438
## 9 1888 -0.236
## 10 1889 -0.178
## # … with 133 more rows
#plotting the data
#Fit the best fit line, using LOESS method
ggplot(average_annual_anomaly) +
aes(x = Year, y = yearly_mean)+
geom_point()+
geom_smooth(method = 'lm') +
theme_bw()The plot proves that annual temprature deltas have been increasing over time.
deltaWe will now focus on the time period between 2011-present. We divert our attention towards producing a confidence interval for the average annual deltas calculated since 2011. We will use both the statistical method and bootstrap simulation to have higher confidence in the results:
#statistical method
formula_ci <- comparison %>%
filter(interval == '2011-present') %>%
group_by(year) %>%
summarise(avg_annual_delta = mean(delta),
sd_delta = sd(delta),
count = n(),
SE = sd(delta/sqrt(count)),
upper_ci = avg_annual_delta + 2*SE,
lower_ci = avg_annual_delta - 2*SE)
#bootstrap simulation
formula_ci_2 <- comparison %>%
filter(interval == '2011-present') %>%
specify(response = delta) %>%
generate(type = 'bootstrap') %>%
calculate(stat = 'mean') %>%
get_confidence_interval(level = 0.95)
#print out formula_CI
formula_ci## # A tibble: 12 × 7
## year avg_annual_delta sd_delta count SE upper_ci lower_ci
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 2011 0.745 0.113 12 0.0327 0.810 0.680
## 2 2012 0.813 0.180 12 0.0520 0.917 0.709
## 3 2013 0.8 0.118 12 0.0340 0.868 0.732
## 4 2014 0.919 0.146 12 0.0421 1.00 0.835
## 5 2015 1.18 0.178 12 0.0515 1.28 1.07
## 6 2016 1.31 0.332 12 0.0958 1.50 1.11
## 7 2017 1.18 0.227 12 0.0655 1.31 1.05
## 8 2018 1.03 0.137 12 0.0395 1.11 0.956
## 9 2019 1.21 0.153 12 0.0441 1.30 1.12
## 10 2020 1.35 0.225 12 0.0648 1.48 1.22
## 11 2021 1.14 0.115 12 0.0332 1.20 1.07
## 12 2022 NA NA 12 NA NA NA
formula_ci_2## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.04 1.04
Looking at the results of the analysis, we can see that the statistical method produces wider confidence intervals for temperature deltas, ranging from 0.13 to approximately 0.3 in width. This is probably due to the low number of observations (12 months in each year), which prohibit a more precise calculation. On the other hand, using bootstrap simulation allows to get a much narrower confidence interval. However, both methods show that temperature deltas have been positive in the time period in question and have been consistently greater than 1 since 2015.
In this section, we will analyse the evolution of approval margins of US President Joe Biden. Glimpsing at the dataset, we notice there are 22 variables and 4,495 observations:
# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv')
glimpse(approval_polllist)## Rows: 4,495
## Columns: 22
## $ president <chr> "Joe Biden", "Joe Biden", "Joe Biden", "Joe Biden"…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate <chr> "9/9/2022", "9/9/2022", "9/9/2022", "9/9/2022", "9…
## $ startdate <chr> "1/19/2021", "1/19/2021", "1/20/2021", "1/20/2021"…
## $ enddate <chr> "1/21/2021", "1/21/2021", "1/21/2021", "1/21/2021"…
## $ pollster <chr> "Rasmussen Reports/Pulse Opinion Research", "Morni…
## $ grade <chr> "B", "B", "B", "B+", "B", "B-", "B+", "B-", "B", "…
## $ samplesize <dbl> 1500, 15000, 1993, 1516, 15000, 1115, 941, 1200, 1…
## $ population <chr> "lv", "a", "rv", "a", "a", "a", "rv", "rv", "a", "…
## $ weight <dbl> 0.3382, 0.2594, 0.0930, 1.2454, 0.2333, 1.1014, 1.…
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve <dbl> 48.0, 50.0, 56.0, 45.0, 51.0, 55.5, 63.0, 58.0, 52…
## $ disapprove <dbl> 45.0, 28.0, 31.0, 28.0, 28.0, 31.6, 37.0, 32.0, 29…
## $ adjusted_approve <dbl> 49.2, 49.4, 55.4, 46.0, 50.4, 54.6, 59.5, 57.5, 51…
## $ adjusted_disapprove <dbl> 40.3, 30.9, 33.9, 29.0, 30.9, 32.5, 38.4, 32.8, 31…
## $ multiversions <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking <lgl> TRUE, TRUE, NA, NA, TRUE, NA, NA, NA, TRUE, TRUE, …
## $ url <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id <dbl> 74247, 74272, 74246, 74327, 74273, 74248, 74256, 7…
## $ question_id <dbl> 139395, 139491, 139394, 139570, 139492, 139404, 13…
## $ createddate <chr> "1/22/2021", "1/28/2021", "1/22/2021", "2/2/2021",…
## $ timestamp <chr> "09:48:31 9 Sep 2022", "09:48:31 9 Sep 2022", "0…
We will first calculate the net approval rate (approve - disapprove) for each week in 2022 along with its 95% confidence interval, and then plot the results as a line plot grouping by respodent group (Adults, Voters, All Groups).
fixed_dates <- approval_polllist %>%
mutate(date = mdy(enddate),
weeks = week(date),
year = year(date),
net_rate = approve - disapprove) %>%
filter(year == 2022, weeks<50) %>%
group_by(subgroup , weeks) %>%
# we calculated the 95% confidence interval.
summarise(mean_rate = mean(net_rate,na.rm=TRUE),
sd_rate = sd(net_rate,na.rm=TRUE),
number = n(),
t_critical = qt(0.975,number-1),
lower_bound = mean_rate - t_critical*sd_rate/ sqrt(number),
upper_bound = mean_rate + t_critical*sd_rate/ sqrt(number))
# we draw the graph of the net approval rate changing over weeks with its confidence interval.
ggplot(fixed_dates, aes(x = weeks, y = mean_rate, color = subgroup)) +
geom_ribbon(aes(ymin = lower_bound, ymax = upper_bound),
fill = "orange",
alpha = 0.25,
show.legend = FALSE) +
facet_grid(rows = vars(subgroup)) +
geom_line(aes(x = weeks, y = mean_rate,
color = subgroup),
show.legend = FALSE) +
labs(title = "Biden's net approval ratings in 2022",
subtitle = "Weekly Data, Approve - Disapprove, %",
caption = "Source: https://projects.fivethirtyeight.com/biden-approval-data/",
x = NULL,
y = NULL) +
theme_minimal()We can see President Biden’s net approval relative has remained negative since the first week of 2022 among all poll respondents, meaning more people disapprove than approve of the President. Furthermore, we notice a sharp decline in the net approval rate beginning in week 23. Since that time, the approval rate seems to have returned to pre-drop levels, potentially due to the POTUS’s recent delivery on several campaign promises, primarily passing the Inflation Reduction Act.
This section focuses on analysing data on rentals in Tfl bike sharing.
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2022-09-06T12%3A41%3A48/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20220910%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20220910T183325Z&X-Amz-Expires=300&X-Amz-Signature=08af6c1aef1cc7da37d0fdc7da3ca949e7f7d76703bb4b2a7efd5ed7bd6ca6d5&X-Amz-SignedHeaders=host]
## Date: 2022-09-10 18:35
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 180 kB
## <ON DISK> /var/folders/1j/4n8_k36148vb884h5ph4r83h0000gn/T//RtmpRbiTwY/file7adb7c015a0.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
sheet = "Data",
range = cell_cols("A:B"))
# change dates to get year, month, and week
bike <- bike0 %>%
clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = as.integer(year(day)),
month = lubridate::month(day, label = TRUE),
week = isoweek(day))
glimpse(bike)## Rows: 4,416
## Columns: 5
## $ day <dttm> 2010-07-30, 2010-07-31, 2010-08-01, 2010-08-02, 2010-08-0…
## $ bikes_hired <dbl> 6897, 5564, 4303, 6642, 7966, 7893, 8724, 9797, 6631, 7864…
## $ year <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010…
## $ month <ord> Jul, Jul, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug…
## $ week <dbl> 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32…
First, we calculate the monthly change in Tfl bike rentals,
calculated as the difference between the actual monthly average and the
historical average calculated between 2016-2019. We plot the data
faceting by year, using geom_ribbon() to visualize the
positive/negative deltas.
#calculating expected number of rentals
compare <- bike %>%
filter(year %in% c(2016:2019)) %>%
group_by(month) %>%
summarise(compare_avg = mean(bikes_hired))
#calculating monthly averages
avg <- bike %>%
filter(year %in% 2017:2022) %>%
group_by(year, month) %>%
summarise(actual_avg = mean(bikes_hired))
#joining datasets
left_join(avg, compare, by = 'month') %>%
#calculating differences
mutate(difference = actual_avg - compare_avg,
pos_diff = ifelse(difference > 0, actual_avg, 0),
neg_diff = ifelse(difference < 0, compare_avg, 0)) %>%
#plotting
ggplot(aes(x = month)) +
geom_line(aes(y = compare_avg, group = 1), color = "blue", lwd = 1.5) +
geom_line(aes(y = actual_avg, group = 1)) +
geom_ribbon(aes(ymin = compare_avg, ymax = pmax(0, difference) + compare_avg, fill = "red", alpha = 0.5, group = 1)) +
geom_ribbon(aes(ymin = pmin(0, difference) + compare_avg, ymax = compare_avg, fill = "green", alpha = 0.5, group = 1)) +
facet_wrap(vars(year)) +
labs(title = "Monthly changes in Tfl bike rentals",
subtitle = "Change from monthly average shown in blue and calculated between 2016-2019",
caption = "Source: Tfl, London Data Store",
x = NULL,
y = "Bike rentals") +
theme(legend.position = "none")We can see that Tfl bike rentals have been lower than in the 2016-2019 at the beginning of the pandemic, but quickly recovered and exceeded historical data. Interestingly, there has been a significant uptake starting in the second half of 2021, possibly due to changing preferences regarding means of transport, with public transport losing users to Tfl bikes.
Next, we plot a similar graph to visualize weekly changes in Tfl bike rentals between actual data and the 2016-2019 average.
#calculating expected number of rentals
compare <- bike %>%
filter(year %in% c(2016:2019)) %>%
group_by(week) %>%
summarise(compare_avg = mean(bikes_hired))
#calculating weekly averages
avg <- bike %>%
filter(year %in% 2017:2022) %>%
group_by(year, week) %>%
summarise(actual_avg = mean(bikes_hired))
#deleting aberrant observations (average for future weeks in 2022)
avg <- avg[-298,]
#joining dataframes
left_join(avg, compare, by = 'week') %>%
#calculating differences
mutate(diff = (actual_avg - compare_avg)/compare_avg,
pos_diff = ifelse(diff > 0, diff, 0),
neg_diff = ifelse(diff < 0, diff, 0)) %>%
#plotting
ggplot(aes(x = week, y = diff)) +
scale_x_discrete(limits = c(13, 26, 39, 53)) +
scale_y_continuous(labels = percent) +
geom_rect(aes(xmin = 13, xmax = 26, ymin = -Inf, ymax = Inf), alpha = 0.3, fill = "grey90") +
geom_rect(aes(xmin = 39, xmax = 53, ymin = -Inf, ymax = Inf), alpha = 0.3, fill = "grey90") +
geom_line(aes(y = diff, group = 1), color = 'black', lwd = 0.8) +
geom_ribbon(aes(ymin = 0, ymax = pmax(0, pos_diff)), fill = 'green', alpha = 0.3) +
geom_ribbon(aes(ymin = pmin(0, neg_diff), ymax = 0), fill = 'red', alpha = 0.3) +
geom_rug(aes(colour = diff),
sides = 'b',
length = unit(0.02, "npc"),
size = 1,
show.legend = FALSE) +
binned_scale(aesthetics = "colour",
scale_name = "stepsn",
palette = function(x) c("red", "green"),
breaks = c(0, 100)) +
facet_wrap(vars(year)) +
theme_minimal() +
labs(title = "Weekly changes in Tfl bike rentals",
subtitle = "% change from weekly averages calculated between 2016-2019",
caption = "Source: Tfl, London Data Store",
x = "Week",
y = NULL)Again, one can easily notice the drops at the beginning of 2020 (start of the pandemic) and in winter of 2021 (COVID wave), as well as the sizable increase in Tfl rentals since the second half of 2021.
It should be noted that the mean has been used to calculate the expected number of bike rentals for each month/week since the data follows a normal distribution, as seen in the histogram below. Otherwise, it would have been optimal to use the median instead, as it is a more robust measure of central tendency.
hist(bike$bikes_hired)